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Abstract 



Artificial general intelligence aims to create agents capable of learning to 
Q I solve arbitrary interesting problems. We define two versions of asymptotic 

optimality and prove that no agent can satisfy the strong version while in 
some cases, depending on discounting, there does exist a non-computable 
*> ' weak asymptotically optimal agent. 
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1 Introduction 

The dream of artificial general intelligence is to create an agent that, starting with 
no knowledge of its environment, eventually learns to behave optimally. This means 
it should be able to learn chess just by playing, or Go, or how to drive a car or mow 
the lawn, or any task we could conceivably be interested in assigning it. 

Before considering the existence of universally intelligent agents, we must be 
precise about what is meant by optimality. If the environment and goal are known, 
then subject to computation issues, the optimal policy is easy to construct using 
an expectimax search from sequential decision theory [NR03] . However, if the true 
environment is unknown then the agent will necessarily spend some time exploring, 
and so cannot immediately play according to the optimal policy. Given a class of 
environments, we suggest two definitions of asymptotic optimality for an agent. 



1. 



An agent is strongly asymptotically optimal if for every environment in the 
class it plays optimally in the limit. 



2. It is weakly asymptotic optimal if for every environment in the class it plays 
optimally on average in the limit. 

The key difference is that a strong asymptotically optimal agent must eventually 
stop exploring, while a weak asymptotically optimal agent may explore forever, but 
with decreasing frequency. 

In this paper we consider the (non-) existence of weak/strong asymptotically opti- 
mal agents in the class of all deterministic computable environments. The restriction 
to deterministic is for the sake of simplicity and because the results for this case 
are already sufficiently non-trivial to be interesting. The restriction to computable 
is more philosophical. The Church- Turing thesis is the unprovable hypothesis that 
anything that can intuitively be computed can also be computed by a Turing ma- 
chine. Applying this to physics leads to the strong Church- Turing thesis that the 
universe is computable (possibly stochastically computable, i.e. computable when 
given access to an oracle of random noise). Having made these assumptions, the 
largest interesting class then becomes the class of computable (possibly stochastic) 
environments. 

In [Hut04], Hutter conjectured that his universal Bayesian agent, AIXI, was 
weakly asymptotically optimal in the class of all computable stochastic environ- 
ments. Unfortunately this was recently shown to be false in [OrslO], where it 
is proven that no Bayesian agent (with a static prior) can be weakly asymptoti- 
cally optimal in this class. ^ The key idea behind Orseau's proof was to show that 
AIXI eventually stops exploring. This is somewhat surprising because it is nor- 
mally assumed that Bayesian agents solve the exploration/exploitation dilemma in 
a principled way. This result is a bit reminiscent of Bayesian (passive induction) 



^Or even the class of computable deterministic environments. 



inconsistency results [DF86a, DF86b], although the details of the failure are very 
different. 

We extend the work of [OrslO], where only Bayesian agents are considered, to 
show that non-computable weak asymptotically optimal agents do exist in the class 
of deterministic computable environments for some discount functions (including 
geometric), but not for others. We also show that no asymptotically optimal agent 
can be computable, and that for all "reasonable" discount functions there does not 
exist a strong asymptotically optimal agent. 

The weak asymptotically optimal agent we construct is similar to AIXI, but 
with an exploration component similar to e-learning for finite state Markov decision 
processes or the UCB algorithm for bandits. The key is to explore sufficiently 
often and deeply to ensure that the environment used for the model is an adequate 
approximation of the true environment. At the same time, the agent must explore 
infrequently enough that it actually exploits its knowledge. Whether or not it is 
possible to get this balance right depends, somewhat surprisingly, on how forward 
looking the agent is (determined by the discount function). That it is sometimes not 
possible to explore enough to learn the true environment without damaging even a 
weak form of asymptotic optimality is surprising and unexpected. 

Note that the exploration/exploitation problem is well-understood in the Ban- 
dit case [ACBF02, BF85] and for (finite-state stationary) Markov decision processes 
[SLOB] . In these restrictive settings, various satisfactory optimality criteria are avail- 
able. In this work, we do not make any assumptions like Markov, stationary, er- 
godicity, or else besides computability of the environment. So far, no satisfactory 
optimality definition is available for this general case. 

2 Notation and Definitions 

We use similar notation to [IIut04, OrslO] where the agent takes actions and the 
environment returns an observation/reward pair. 

Strings. A finite string a over alphabet .4. is a finite sequence 010203 ■ ■ ■ o„_iO„ 
with Oj G A. An infinite string uj over alphabet A is an infinite sequence ujiUJ20J^ ■ ■ ■ . 
A"', A* and A°^ are the sets of strings of length n, strings of finite length, and 
infinite strings respectively. Let x be a string (finite or infinite) then substrings 
are denoted Xg-.t '■= XgXs+i- ■ • Xt-iXt where s,t G N and s < t. Strings may be 
concatenated. Let x,y & A* of length n and m respectively, and 00 G A°°. Then 
define xy := X1X2 ■ ■ ■ Xn-iXnyiy2 ■ ■ ■ ym-Wm and xu := a;ia;2 • • • Xn-iXnUJiUJ2^z ■■ ■ ■ 
Some useful shorthands, 

x<t ■= xi;t-i w<t ■= yiXiy2X2 ■ ■ ■ yt-ixt-i- (1) 

The second of these is ambiguous with concatenation, so wherever yic^t appears 
we assume the interleaving definition of (1) is intended. For example, it will be 
common to see yx^^tUti which represents the string yiXiy2X2yzXz • ■ ■ yt-i^t-iyt- For 
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agent, tv 



environment, /i 



1/1 



yi 
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binary strings, we write #l(a) and #0(a) to mean the number of O's and number 
of I's in a respectively. 

Environments and Optimality. Let 3^, O and 

7^ C M be action, observation and reward spaces 

respectively. Let X = O x TZ. An agent interacts 

with an environment as illustrated in the diagram 

on the right. First, the agent takes an action, upon 

which it receives a new observation/reward pair. 

The agent then takes another action, receives another observation/reward pair, and 

so-on indefinitely. The goal of the agent is to maximise its discounted rewards 

over time. In this paper we consider only deterministic environments where the 

next observation/reward pair is determined by a function of the previous actions, 

observations and rewards. 

Definition 1 (Deterministic Environment). A deterministic environment /x is a 
function fi : {y x X)* x y -^ X where n^yx^tUt) € A' is the observation/reward pair 
given after action yt is taken in history yx^f Wherever we write Xt we implicitly 
assume Xt = {ot,rt) and refer to Ot and r^ without defining them. An environment 
fi is computable if there exists a Turing machine that computes it. 

Note that since environments are deterministic the next observation need not 
depend on the previous observations (only actions). We choose to leave the depen- 
dence as the proofs become clearer when both the action and observation sequence 
is more visible. 

Assumption 2. y and O are finite, TZ = [0, 1]. 

Definition 3 (Policy). A policy tt is a function from a history to an action tt : 

(3^ X xy -^ y. 

As expected, a policy tt and environment fi can interact with each other to 
generate a play-out sequence of action/reward/observation tuples. 

Definition 4 (Play-out Sequence). We define the play-out sequence ipc^''^ G (3^ x 
X)°° inductively by y^'^ := vr(|/r^'^) and x^'^ := /i(|^<fc|/fc'^). 

We need to define the value of a policy tt in environment /i. To avoid the 
possibility of infinite rewards, we will use discounted values. While it is common 
to use only geometric discounting, we have two reasons to allow arbitrary time- 
consistent discount functions. 

1. Geometric discounting has a constant effective horizon, but we feel agents 
should be allowed to use a discount function that leads to a growing horizon. 
This is seen in other agents, such as humans, who generally become less myopic 
as they grow older. See [FOO02] for a overview of generic discounting. 



2. The existence of asymptotically optimal agents depends critically on the ef- 
fective horizon of the discount function. 

Definition 5 (Discount Function). A regular discount function 7 G M°° is a vector 
satisfying 7^ > and < Yl'^k It < 00 for all k &N. 

The first condition is natural for any definition of a discount function. The 
second condition is often cited as the purpose of a discount function (to prevent 
infinite utilities), but economists sometimes use non-summable discount functions, 
such as hyperbolic. The second condition also guarantees the agent cares about the 
infinite future, and is required to make asymptotic analysis interesting. We only 
consider discount functions satisfying all three conditions. In the following, let 
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7i Ht{p) := min 

h€N 
t 




An infinite sequence of rewards starting at time t, rt, rt+i, rt+2, ■ ■ ■ is given a value of 
f" X^i^t 7«^«- '^^^ term ^ is a normalisation term to ensure that values scale in such 
a way that they can still be compared in the limit. A discount function is computable 
if there exists a Turing machine computing it. All well known discount functions, 
such as geometric, fixed horizon and hyperbolic are computable. Note that Ht{p) 
exists for all p G [0, 1) and represents the effective horizon of the agent. After Ht{p) 
time-steps into the future, starting at time t, the agent stands to gain/lose at most 
1 — p. 

Definition 6 (Values and Optimal Policy). The value of policy tt when starting from 
history yx'^^ in environment jj, is V^iipc^^) := f- Xlfel* 7fc^fc''^- The optimal pohcy 
TT* and its value V* are defined 7r*(yr<t) := argmax^ V^''(yr<t) and V*{iiJc^t) '■ = 

Assumption 2 combined with Theorem 6 in [LHll] guarantees the existence 
of n* Note that the normalisation term ^ does not change the policy, but is 
used to ensure that values scale appropriately in the limit. For example, when 
discounting geometrically we have, 7t = 7* for some 7 G (0, 1) and so Ft = -^ and 

Definition 7 (Asymptotic Optimality). Let Ai = {/xo, /ii, ■ ■ ■ } be a finite or count- 
able set of environments and 7 be a discount function. A policy tt is a strong 
asymptotically optimal policy in {Ai,j) if 

lim [V;{yrTn) - V;{w'^:)] = 0, for all f, e M. (2) 

It is a weak asymptotically optimal policy if 

1 " 

lim - J2 [y;iw^<n - v;(i^t^'n] = o, for ^Wf^eM. (3) 



Strong asymptotic optimality demands that the value of a single pohcy n con- 
verges to the value of the optimal policy vr* for all fi in the class. This means that 
in the limit, a strong asymptotically optimal policy will obtain the maximum value 
possible in that environments. 

Weak asymptotic optimality is similar, but only requires the average value of the 
policy 71 to converge to the average value of the optimal policy. This means that a 
weak asymptotically optimal policy can still make infinitely many bad mistakes, but 
must do so for only a fraction of the time that converges to zero. Strong asymptotic 
optimality implies weak asymptotic optimality. 

While the definition of strong asymptotic optimality is rather natural, the defini- 
tion of weak asymptotic optimality appears somewhat more arbitrary. The purpose 
of the average is to allow the agent to make a vanishing fraction of serious errors over 
its (infinite) life-time. We believe this is a necessary condition for an agent to learn 
the true environment. Of course, it would be possible to insist that the agent make 
only o(logn) serious errors rather than o{n), which would make a stronger version 
of weak asymptotic optimality. Our choice is the weakest notion of optimality of 
the above form that still makes sense, which turns out to be already too strong for 
some discount rates. 

Note that for both versions of optimality an agent would be considered optimal if 
it actively undertook a policy that led it to an extremely bad "hell" state from which 
it could not escape. Since the state cannot be escaped, its policy would then coincide 
with the optimal policy and so it would be considered optimal. Unfortunately, this 
problem seems to be an unavoidable consequence of learning algorithms in non- 
ergodic environments in general, including the currently fashionable PAC algorithms 
for arbitrary finite Markov decision processes. 

3 Non- Existence of Asymptotically Optimal Poli- 
cies 

We present the negative theorem in three parts. The first shows that, at least for 
computable discount functions, there does not exist a strong asymptotically optimal 
policy. The second shows that any weak asymptotically optimal policy must be 
incomputable while the third shows that there exist discount functions for which 
even incomputable weak asymptotically optimal policies do not exist. 

Theorem 8. Let Ai he the class of all deterministic computable environments and 
7 a computable discount function, then: 

1. There does not exist a strong asymptotically optimal policy in (A^,7). 

2. There does not exist a computable weak asymptotically optimal policy in 



3. If 'jk '■= k(k>i) then there does not exist a weak asymptotically optimal policy 
in {M,j). 

Part 1 of Theorem 8 says there is no strong asymptotically optimal policy in 
the class of all computable deterministic environments when the discount function 
is computable. It is likely there exist non-computable discount functions for which 
there are strong asymptotically optimal policies. Unfortunately the discount func- 
tions for which this is true are likely to be somewhat pathological and not realistic. 

Given that strong asymptotic optimality is too strong, we should search for weak 
asymptotically optimal policies. Part 2 of Theorem 8 shows that any such policy is 
necessarily incomputable. This result features no real new ideas and relies on the 
fact that you can use a computable policy to hand-craft a computable environment 
in which it does very badly [Leg06]. In general this approach fails for incomputable 
policies because the hand-crafted environment will then not be computable. Note 
that this does not rule out the existence of a stochastically computable weak asymp- 
totically optimal policy. 

It turns out that even weak asymptotic optimality is too strong for some discount 
functions. Part 3 of Theorem 8 gives an example discount function for which no 
such policy (computable or otherwise) exists. In the next section we introduce a 
weak asymptotically optimal policy for geometric (and may be extended to other) 
discounting. Note that 7^ = i^(j},i) is an example of a discount function where 
Ht{p) = Q{t). It is also analytically easy to work with. 

All negative results are proven by contradiction, and follow the same basic form. 

1. Assume tt is a computable/arbitrary weak/strong asymptotically optimal. 

2. Therefore tt is weak/strong asymptotically optimal in fi for some particular /i. 

3. Construct u, which is indistinguishable from /x under tt, but where n is not 
weak/strong asymptotically optimal in z/. 

Proof of Theorem 8, Part 1. Let y = {up, down} and (9 = 0. Now assume some 
policy TT is a strong asymptotically optimal policy. Define an environment ^ by, 

= down 

That is ^{yr^tUt) G 7^ is the reward given when taking action yt having previously 
taken actions |/<j. Note that we have omitted the observations as (9 = 0. It is easy 
to see that the optimal policy 7i*{yr^t) = up for all yr^t with corresponding value 
V*{yr^t) = ^- Since n is strongly asymptotically optimal, 

lim V^iw',;:) = l- (4) 




Assume there exists a time- sequence ti, t2, ta, • • • such that y^'^ = down (and hence 
0) for all t G \J^i[ti-,ti + HtXj)]- Therefore by the definition of the value 
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function, 



VTK") < 



< 



1 



1 






k, — tj 



(5) 
(6) 



where (5) follows from the definitions of the value function and F, and the assump- 
tion in the previous line. (6) follows by algebra and the definition of ift.(|). This 
contradicts (4). Therefore for any strong asymptotically optimal policy tt there ex- 
ists a T G N such that for all t > T, y^''' = up for some s E [t,t + Ht{^)]. I.e, vr 
cannot take sub-optimal action down too frequently. In particular, it cannot take 
action down for large contiguous blocks of time. Construct a new environment z/ 
defined by 



v{w<tyt) 



f^{w<tyt) 







if t <T 

if yt = up 

if lit = down and exist t' > T such that t' + Hti{\) < t and 

ys = down\/se [f ,t' + Ht'{\)\ 
otherwise 



Note that v is computable if Ht{j) is and that by construction the play-out sequences 
for /i and v when using policy vr are identical. We now consider the optimal policy 
in V. For any t > T consider the value of policy vf defined by fi{yr^t) '■= down for 
all yr^t. 



VJiw 



<t) 



1 



> 



5Z ^fc 

_k=t+Ht{\) 



This is because tt spends Ht{\) — 1 time-steps playing down and receiving reward 
before "unlocking" a reward of 1 on all subsequent plays. On the other hand, 
^y{w^<t) — 2 because vr can never unlock the reward of 1 because it never plays 
down for a contiguous block of Ht{j) time-steps. By the definition of the optimal 
policy, VJiyr^t) < V*{yr<t)- Therefore 



K^(r<f)-v;^(r<'r)> 



Therefore 

limsup [V:{yr''<l) - V7(r<'r)] > ^ 7^ 0. 

Therefore there does not exist an asymptotically optimal policy vr in (A^,7). D 

Proof of Theorem 8, Part 2. Let y = {up, down} and (9 = 0. Now let Ai be the 
class of all computable deterministic environments and 7 be an arbitrary discount 
function. Suppose vr is computable and consider the environment fi defined by 



Kw<tyt) 



1 a yt ^ niyr^t) 
otherwise 



Since vr is computable /i is as well. Therefore ji E M.. Now V*{yf<t) = 1 for all yr<:t 
while V^{yr<t) = 0. Therefore lim„^oo ^ I]"=i |V'^*(r<t) - V,^{l/'<t)\ = 1 and so n is 
not weakly asymptotically optimal. D 



Proof of Theorem 8, Part 3. Recall 'jk 
{up, down} and C = 0. Define /i by 



k{k+l) 



and so Ft 



Now let 3^ 



i^{w<tyt) 



if 1/t 
if 1/t 



up 
down 



where e G (0, |) will be chosen later. As before, V*{yr^t] 
asymptotically optimal. Therefore 



i. Assume n is weakly 



n 



n— >-oo JT, 



t=l 



(7) 



We show by contradiction that vr cannot explore (take action down) too often. 
Assume there exists an infinite time-sequence ti,t2,t^,- ■ ■ such that T^{w<t) = down 
for all t G IJi^i[^J5 2tj]. Then for t G [tj, |tj] we have 



^;(r^f) 



-. 00 



k=t 
1 



1- 






2ti ^ 00 



fc=t 



k=2ti + l 



2ti + l 



1 e 
<2-4 



(9) 



where (8) is the definition of the value function and the previous assumption and 
definition of /i. (9) by algebra and since t G [ti, |tj]. Therefore 



. 2ti 

* t=l 



<t ) 



< 



2t- 






t=l 



t=t« 
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-e. 



(10) 



The first inequality follows from (9) and because the maximum value of any play-out 
sequence in /i is |. The second by algebra. Therefore liminf„_j.oo ^ J2^=i ^^{w 



<t 



< 



\ — j^e < |, which contradicts (7). Therefore there does not exist a time-sequence 
ti < t2 < ^3 < ■ ■ ■ such that rc{yf^^) = down for all t G IJi^J^*' 2tj]. 

So far we have shown that tt cannot "explore" for t consecutive time-steps start- 
ing at time-step t, infinitely often. We now construct an environment similar to fi 
where this is required. Choose T to be larger than the last time-step t at which 
y^''" = down for all s & [t, 2t] Define u by 



i^{w<tyt) 



' i^{w<tyt) 



\ 2 



ift <T 

if yt = down and there does not exist t' > T 

such that ys = down\fs G [f, 2t'] 
if yt = down and exists t' > T such that 2t' < t and 

ys = downWs G [f, 2t'] 
otherwise 



Now we compare the values in environment z/ of tt and tt* at times t > T. Since vr 
does not take action down for t consecutive time-steps at any time after T, it never 
"unlocks" the reward of 1 and so VJ^yr'^J) < |. Now let 7i{yr^t) = down for all 
yr^f Therefore, for t > 2T, 



-. oo 

VJ{yr:^:)^^J2^krf>t 

* k=t 



X 2t-l oo 

^ k=t k=2t 



1 



— e 



1 
2t 



1' 
2t 



3 1 

A--2' 



111 



(12) 



where (11) follows by the definition of v and vr. (12) 
and algebra. Finally, setting e = \ gives VJiif-^^'^^ 



<t 



V: > V7, we get V;*(r<f ) - VJiyr^ > V7(r<'r) - VJ{yr<t 

lim sup„^^ i Xir=i [^*(r<r) - VJiw<t)] > |> and so tt is not weakly asymptoti 



by the definition of 7^ 
I + |. Since 
> I . Therefore 



> I 
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cally optimal. 



D 



We believe it should be possible to generalise the above to computable discount 
functions with Ht{p) > Cpt with Cp > for infinitely many t, but the proof will likely 
be messy. 

4 Existence of Weak Asymptotically Optimal 
Policies 

In the previous section we showed there did not exist a strong asymptotically optimal 
policy (for most discount functions) and that any weak asymptotically optimal policy 
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must be incomputable. In this section we show that a weak asymptotically optimal 
policy exists for geometric discounting (and is, of course, incomputable). 

The policy is reminiscant of e-exploration in finite state MDPs (or UCB for 
bandits) in that it spends most of its time exploiting the information it already 
knows, while still exploring sufficiently often (and for sufficiently long) to detect any 
significant errors in its model. 

The idea will be to use a model-based policy that chooses its current model to be 
the first environment in the model class (all computable deterministic environments) 
consistent with the history seen so far. With increasing probability it takes the best 
action according to this policy, while still occasionally exploring randomly. When it 
explores it always does so in bursts of increasing length. 

Definition 9 (History Consistent). A deterministic environment /i is consistent 
with history ipo^t if fiiyx^kUk) = Xk, for all k < t. 

Definition 10 (Weak Asymptotically Optimal Pohcy). Let 3^ = {0, 1} and A^ = 
{/xi, /X2, Ai3, ■ ■ ■ } be a countable class of deterministic environments. Define a prob- 
ability measure P on B°° inductively by, P(-2„ = l|2;<n) := -, for all 2;<„ G B^'^. 
Now let X ^ ^'^ be sampled from P and define x^X^ ^ ^'^ by 

^^.^1 if^eu^,=i[^,^+iog^] ;e^=l° '^^'■■'+'^ = ^'^' 

otherwise 1 1 otherwise 

Next let ip be sampled from the uniform measure (each bit of ip is independently 
sampled from a Bernoulli 1/2 distribution) and define a policy tt by, 

HW<,) := h^'^<'^ '^^' = ' (13) 

I ipt otherwise 

where Ut = yUj^ with it = min {i : /ij consistent with history yic^t} < oo. Note that 
it is always finite because there exists an i such that //» = //, in which case /Xj is 
necessarily consistent with yr<f . 

Intuitively, Xfc = 1 at time-steps when the agent will explore for log k time-steps. 
Xfc = 1 if the agent is exploring at time k and ipk is the action taken if exploring at 
time-step k. x ^iH be used later, with x^ = 1 if the agent will explore at least once 
in the interval [k, k + h]. If the agent is not exploring then it acts according to the 
optimal policy for the first consistent environment in Ai. 

Theorem 11. Let '-ft = 7* with 7 G (0, 1) (geometric discounting) then the policy 
defined in Definition 10 is weakly asymptotically optimal in the class of all deter- 
ministic computable environments with probability 1. 

Some remarks: 
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1. That 3^ = {0, 1} is only convenience, rather than necessity. The pohcy is easily 
generalised to arbitrary finite 3^. 

2. TT is essentially a stochastic policy. With some technical difficulties it is possible 
to construct an equivalent deterministic policy. This is done by choosing x 
to be any P-Martin-Lof random sequence and ■?/' to be a sequence that is 
Martin-Lof random w.r.t to the uniform measure. The theorem then holds for 
all deterministic environments. The proof is somewhat delicate and may not 
extend nicely to stochastic environments. For an introduction to Kolmogorov 
complexity and Martin-Lof randomness, see [LV08]. For a reason why the 
stochastic case may not go through as easily, see [HM07]. 

3. The policy defined in Definition 10 is not computable for two reasons. First, 
because it relies on the stochastic sequences x and ip. Second, because the 
operation of finding the first environment consistent with the history is not 
computable.^ We do not know if there exists a weak asymptotically optimal 
policy that is computable when given access to a random number generator 
(or if it is given x a-^id ip). 

4. The bursts of exploration are required for optimality. Without them it will be 
possible to construct counter-example environments similar to those used in 
part 3 of Theorem 8. 

Before the proof we require some more definitions and lemmas. Easier proofs are 
omitted. 

Definition 12 (/i-Difference). Let /i and v be two environments consistent with 
history yr<f , then jj, is h-dijferent to u if there exists yxut+h satisfying 

i/k = 7r*(p:<fe) for all k e [t,t + h], 
Xk = (J^{yx<kyk) for all k e [t, t + /i], 
Xk 7^ viWKkyk) for some ke [t,t + h]. 

Intuitively, /i is /i-different to v at history |/r<i if playing the optimal policy for 
/i for h time-steps makes v inconsistent with the new history. Note that /i-difference 
is not symmetric. 

Lemma 13. //a„ G [0, 1] and limsup„_^oo - YL^=i "^n = e CLnd a G B°^ is an indicator 
sequence with ai := [aj > e/4],^ then YUli [l ~ 1^] ~ 0- 

See the appendix for the proof. 

Lemma 14. Let oi, 02, 03, ■ ■ ■ be a sequence with an G [0, 1] for all n. The following 
properties of x o-t^^ true with probability 1. 



^The class of computable environments is not recursively enumerable [LV08] 
^ lexpressionj = 1 if expression is true and otherwise. 
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1. For any h, limsup„^^ i#l(^^^^) = 0. 

2. // limsup - ^"^^ Cj = e > and ai := |aj > e/2] then ai = Xi = ^ for 
infinitely many i. 

Proof. 1. Let i e N, e > and Ef be the event that #l(xi.2») > 2*e. Using the 
definition of x^ to compute the expectation E [7^1(x^.2i)] < "^(^ + 1)^ and applying 
the Markov inequahty gives that P{El) < i{i + l)/i2~Ye. Therefore 'Yl,i&%P{^i) < 
oo. Therefore the Borel-Cantelh lemma gives that El occurs for only finitely many i 
with probability 1. We now assume that hmsup^^^^ ^?i^l(Xi:n) > 2e > and show 
that El must occur infinitely often. By the definition of lim sup and our assumption 
we have that there exists a sequence ni,n2, ■ ■ ■ such that #l(Xi:„.) > 2nje for all 
i G N. Let n^ := minfeg^ {^^ : 2^^ > n} and note that #l(x? +) > ^t^i which is 
exactly El +. Therefore there exist infinitely many i such that El occurs and so 

■^ logn -^ -^ ' 

limsup„^^ i#l(^^^^) = with probability 1. 

2. The probability that a-i = 1 =^ Xj = for alH > T is P{ai = 1 =^ Xi = 
OV? > T) = Hi^T (-'- ^ 1^) =• P = 0) by Lemma 13. Therefore the probability that 
c^i = Xi = 1 for only finitely many i is zero. Therefore there exists infinitely many i 
with ctj = Xj = 1 with probability 1, as required. D 

Lemma 15 (Approximation Lemma). Let tti and 712 be policies, fi an environ- 
ment and h > Ht{l — e). Let yx^t be an arbitrary history and W^p^^^ be the future 
action/observation/reward triples when playing policy iii. If yxl\'_'^j^ = yr^^+h ^^^'^ 
\V^'{m<t)-V;^{w:<t)\<e. 

Proof. By the definition of the value function, 

_. 00 
\V;^{w:<t)-V;^{yr^,)\ < _^7,|r-'^-rr'1 (14) 

* i=t 

_. 00 _. 00 

* i=t+h+l * i=t+h+l 

(14) follows from the definition of the value function. (15) since rj^''^ = rj^''^ for 
i & [t,t + h], rewards are bounded in [0, 1] and by the definition of h := Ht{l — e) 
(Definition 5). D 

Recall that tt* and tt* are the optimal policies in environments fi and u respec- 
tively (see Definition 6). 

TT* * 

Lemma 16 (/i-difference). // \Vi_i^{yx'^^) — V^/7''(yr<f)| > e then fi is Ht{l — e)- 
different to v on ipc'^'^. 

Proof. Follows from the approximation lemma. D 

We are now ready to prove the main theorem. 
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Proof of Theorem 11. Let tt be the policy defined in Definition 10 and /i 
be the true (unknown) environment. Recall that Vt = f^u with it = 
min {i : fii consistent with history yx^j^} is the first model consistent with the his- 
tory yjc^t at time t and is used by n when not exploring. First we claim there exists 
a T G N and environment u such that Vt = ^ for all t >T. Two facts, 

1. If /ij is inconsistent with history |/r<f then it is also inconsistent with i/r^'/^/j 
for all h eN. 

2. yU is consistent with yic^t for all tt, t. 

By 1) we have that the sequence ii, i2, is,- ' ' is monotone increasing. By 2) we have 
that the sequence is bounded by i with /ij = fi. The claim follows since any bounded 
monotone sequence of natural numbers converges in finite time. Let z/ := z/qo be the 
environment to which z/i, z/2, z/3, ■ ■ ■ converges to. Note that u must be consistent 
with history yx'^t for all t. We now show by contradiction that the optimal policy 
for u is weakly asymptotically optimal in environment /x. Suppose it were not, then 

1 " 
hm sup - J2 [V; (l^<f ) - V< (t/r-f )] = e > 0. (16) 

Let a G B°° be defined by a^ := 1 if and only if, 

[v;{yr-S)-V;{yr%n]>e/4. (17) 

By Lemma 14 there exists (with probability one) an infinite sequence ti,t2,t3, ■ ■ ■ 
for which Xk = ctk = I. Intuitively we should view time-step tk as the start 
of an "exploration" phase where the agent explores for logt^ time-steps. Let 
h := Ht^{l — e/4) = [log(e/4)/log7], which importantly is independent of tk (for 
geometric discounting). Since logt^ — )■ c>o we will assume that logt^ > h for all 
tk- Therefore Xi = 1 for all i G [j'kLi[tk,tk + h]. Therefore by the definition of 
IT, 7r(yr^'f) = ipi for i G [JkLi[tk,tk + h]. By Lemma 16 and Equation (17), /i is 
/i-different to u on history yx^^^. This means that if there exists a k such that tt 
plays according to the optimal policy for /i on all time-steps t G [tk,tk + h] then 
z/ will be inconsistent with the history yic'^'.^ _^_f^, which is a contradiction. We now 
show that vr does indeed play according to the optimal policy for n for all time- 
steps t G [tk,tk + h] for at least one k. Formally, we show the following holds with 
probability 1 for some k. 

ipi = vr(yr^f) = 7r*(yr^f), for all i G [tk,tk + h]. (18) 

Recall that ip G B°° where ipi & B is identically independently distributed according 
to a Bernoulli(i) distribution. Therefore P{ipi = '?r^(|^<f )) = |- Therefore p := 

P{i,, = 7r;(yr^f)V. G [tk,tk + h]) =^n-=t^(f. = ^;(l/^<f)) = 2-'-' > and 
P(yk3i G [tk,tk + h] with ipi 7^ '7r^(|^<f)) = Ilfclill ~ P) = 0- Therefore Equation 
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(18) is satisfied for some k with probability 1 and so Equation (16) leads to a 
contradiction. Therefore 

n 

li^ - E [^;(^<f) - ^;^(^<f)] = 0. (19) 

We have shown that the optimal policy for v has similar /i-values to the optimal 
policy for /i. We now show that vr acts according to tt* sufficiently often that it too 
has values close to those of the optimum policy for the true environment, /x. Let 
e > 0, /i := Htil — e) and t > T. If x^ = then by the definition of tt and the 
approximation lemma we obtain 

|\/<(p:-f)-V7(l/^<f)|<e. (20) 

Therefore 

1 - ,_^,^ ^ ^ ^_, __ 1 



hm sup - ^ I V^'- (yr<f ) - V7(yr<f) | < lim sup - 



T-1 



i + EK(i-^)+^] 



t=l t=T 



(21) 



= e + (l-e)limsup-#l(xk) (22) 

= e (23) 

where (21) follows since values are bounded in [0, 1] and Equation (20). (22) follows 
by algebra. (23) by part 1 of Lemma 14. By sending e — )■ 0, 

1 " 

li^ - E [^;^(^<n - v;iv^<t)] = o. (24) 

t=l 

Finally, combining Equations (19) and (24) gives the result. D 

We expect this theorem to generalise without great difficulty to discount func- 
tions satisfying Ht{p) < Cplog(t) for all p. There will be two key changes. First, 
extend the exploration time to some function E{t) with E{t) G 0{Hp{t)) for all 
p. Second, modify the probability of exploration to ensure that Lemma 14 remains 
true. 



5 Discussion 

Summary. Part 1 of Theorem 8 shows that no policy can be strongly asymptoti- 
cally optimal for any computable discount function. The key insight is that strong 
asymptotic optimality essentially implies exploration must eventually cease. Once 
this occurs, the environment can change without the agent discovering the difference 
and the policy will no longer be optimal. 
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A weaker notion of asymptotic optimality, that a policy be optimal on average 
in the limit, turns out to be more interesting. Part 2 of Theorem 8 shows that no 
weak asymptotically optimal policy can be computable. We should not be surprised 
by this result. Any computable policy can be used to construct a computable en- 
vironment in which that policy does very badly. Note that by computable here we 
mean deterministic and computable. There may be computable stochastic policies 
that are weakly asymptotically optimal, but we feel this is unlikely. 

Part 3 of Theorem 8, shows that even weak asymptotically optimal policies need 
not exist if the discount function is sufficiently far-sighted. On the other hand. 
Theorem 11 shows that weak asymptotically optimal policies do exist for some dis- 
count rates, in particular, for the default geometric discounting. These non-trivial 
and slightly surprising result shows that choice of discount function is crucial to the 
existence of weak asymptotically optimal policies. Where weak asymptotically opti- 
mal policies do exist, they must explore infinitely often and in increasing contiguous 
bursts of exploration where the length of each burst is dependent on the discount 
function. 

Consequences. It would appear that Theorem 8 is problematic for artificial general 
intelligence. We cannot construct incomputable policies, and so we cannot construct 
weak asymptotically optimal policies. However, this is not as problematic as it may 
seem. There are a number of reasonable counter arguments: 

1. We may be able to make stochastically computable policies that are asymp- 
totically optimal. If the existence of true random noise is assumed then this 
would be a good solution. 

2. The counter-example environment constructed in part 2 of Theorem 8 is a sin- 
gle environment roughly as complex as the policy itself. Certainly, if the world 
were adversarial this would be a problem, but in general this appears not to 
be the case. On the other hand, if the environment is a learning agent itself, 
this could result in a complexity arms race without bound. There may ex- 
ist a computable weak asymptotically optimal policy in some extremely large 
class of environments. For example, the algorithm of Section 4 is stochasti- 
cally computable when the class of environments is recursively enumerable and 
contains only computable environments. A natural (and already quite large) 
class satisfying these properties is finite-state Markov decision processes with 
{0, l}-valued transition functions and rational- valued rewards. 

3. While it is mathematically pleasant to use asymptotic behaviour to charac- 
terise optimal general intelligent behaviour, in practise we usually care about 
more immediate behaviour. We expect that results, and even (parameter free) 
formal definitions of intelligence satisfying this need will be challenging, but 
worthwhile. 

4. Accept that even weak asymptotic optimality is too strong and find something 
weaker, but still useful. 

16 



Relation to AIXI. The policy defined in Section 4 is not equivalent to AlXl 
[Hut04], which is also incomputable. However, if the computable environments in 
Ai are ordered by complexity then it is likely the two will be quite similar. The 
key difference is the policy defined in this paper will continue to explore whereas it 
was shown in [OrslO] that AIXI eventually ceases exploration in some environments 
and some histories. We believe, and a proof should not be too hard, that AIXI 
will become weakly asymptotically optimal if an exploration component is added 
similarly as in Section 4. 

We now briefly compare the self-optimising property in [Hut02] to strong 
asymptotic optimality. A policy vr is self-optimising in a class Ai if 
hmf^oo [^^*(l/^<t) ~ ^(l/^<t)] = fo^ 3.ny infinite history |/ri:oo and for all fi E Ai. 
This is similar to strong asymptotic optimality, but convergence must be on all histo- 
ries, rather than the histories actually generated by vr. This makes the self-optimising 
property a substantially stronger form of optimality than strong asymptotic opti- 
mality. It has been proven that if there exists self-optimising policy for a particular 
class, then AIXI is also self-optimising in that class [Hut02]. 

It is possible to define a weak version of the self-optimising property by insisting 

that lim„^oo ^ ZltLi [^^ (^<*) ~ ^i^iy^<t)] = ^ ^^^ ^^^ yxi-oo and all jj, G M. It can 
then be proven that the existence of a weak self-optimising policy would imply that 
AIXI were also weakly self-optimising. However, the policy defined in Section 4 
cannot be modified to have the weak self-optimising property. It must be allowed 
to choose its actions itself. This is consistent with the work in [OrslO] which shows 
that AIXI cannot be weakly asymptotically optimal, and so cannot be weak self- 
optimising either. 

Discounting. Throughout this paper we have assumed rewards to be discounted ac- 
cording to a summable discount function. A very natural alternative to discounting, 
suggested in [LH07], is to restrict interest to environments satisfying YlT=i ^fe'^ — ^■ 
Now the goal of the agent is simply to maximise summed rewards. In this setting 
it is easy to see that the positive theorem is lost while all negative ones still hold! 
This is unfortunate, as discounting presents a major philosophical challenge. How 
to choose a discount function? 

Assumptions/Limitations. Assumption 2 ensures that y and O are finite. All 
negative results go through for countable y and O. The optimal policy of Section 
4 may not generalise to countable 3^. We have also assumed bounded reward and 
discrete time. The first seems reasonable while the second allows for substantially 
easier analysis. Additionally we have only considered deterministic computable en- 
vironments. The stochastic case is unquestionably interesting. We invoked Church 
thesis to assert that computable stochastic environments are essentially the largest 
class of interesting environments. 

Many of our Theorems are only applicable to computable discount functions. 
All well-known discount function in use today are computable. However [Hut04] 
has suggested 7t = 2~^*^*\ where K{t) is the (incomputable) prefix Kolmogorov 
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complexity of t, may have nice theoretical properties. 

Open Questions. A number of open questions have arisen during this research. 
In particular, 

1. Prove Theorem 8 for a larger class of discount functions. 

2. Prove or disprove the existence of a weak asymptotically optimal stochastically 
computable policy for some discount function in the class of deterministic 
computable environments. 

3. Modify the policy of Section 4 to the larger class of stochastically computable 
environments. We believe this to be possible along the lines of [RH08], but 
inevitably the analysis will be messy and complex. 

4. Extend Part 3 of Theorem 8 and Theorem 1 1 to a complete classification of dis- 
count functions according to whether or not they admit a weak asymptotically 
optimal policy in the class of computable environments. 

5. Prove that AIXI is weakly asymptotically optimal when augmented with an 
exploration component as in Section 4. 

6. Define and study other formal measures of optimality/intelligence. 

Acknowledgements. We thank Laurent Orseau, Wen Shao and reviewers for 
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A Technical Proofs 

Lemma 17. Let A = {ai, a2, ■ ■ ■ , a„} with a G [0, 1] for all a E A. If ^ J2aeA ^ > e 
then 



e 

a E A : a > - 
- 2 



e 

> n- 

2 
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Proof. Let Ay := |a G A : a > |} and A< := A — A>. Therefore 



< 






aeA, 



aeyl< 



aeA:; 



\A<\-^ + \A>\ 



By rearranging and algebra, |{aGA:a>|}| = |A>| > n| as required. 
Proof of Lemma 13. First, 



oo 



< 



exp 



E? 



i=l 



D 



(25) 



Equation (25) follows since 1 — a < exp(— a) for all a. 

Now since limsup„^o^ ;i ^"^^ a„ = e, we have for any N there exists an 



n 



> N such that - X^iLi '^n ^ !• ■'^^^ ni = then inductively choose 



rii 



mm <n : n > 



(raj-l+l) . 1 Y^n 



By Lemma 17, 



Therefore 



A^ELia.>f} 



i < n.i : a,- > - 

- J ^ - ^ 



- %7 



i=n,+l 



fti 



> 



E 



n, 



i+i 



"j+i 



s E 

«=(i-f)™j+i 



n^+i 



(26) 



(27) 



(2J 



Equation (27) follows from (26) and because 4 is a decreasing function. (28) follows 
from the definition of rij and algebra. Therefore 

CO k «j+i 

y^= hmV y ^ 



i=l jr=l i=nj + l 



> lim } - 

k-t-oc^^-^ 8 



oo 



(29) 



Finally, substituting Equation (29) into (25) gives. 






i J 



as required. 



D 
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B Table of Notation 



Symbol 


Description 


y 


Set of possible actions 


o 


Set of possible observations 


n 


Set of possible rewards 


/i,Z/ 


Environments 


y 


An action. 


X 


An observation/reward pair 


r 


A reward 





An observation 


{exprj 


The delta function, lexpression} = 1 if expression is true and 




otherwise. 


^b 


The not function. -lO = 1 and -il = 0. 


TX 


A policy. 


X 


An infinite binary string. Xfc = 1 if an agent starts exploring at time- 




step k. 


X 


An infinite binary string. Xfc = 1 if an agent is exploring at time-step 

k. 

An infinite binary string, x^ = if an agent will not explore for the 


t 




next h time-steps. 


a 


An infinite binary string. 


i, 


An infinite random binary string sampled from the coin flip measure. 


t,n,i,j,k 


Time indices. 


W<t 


A sequence of action/observation/reward sequences. Splits into 




yiOiriy202r2 ■ ■ -yt-iOt-irt^i. 


w'<: 


The sequence of action/reward/observations seen in deterministic en- 




vironment /i when playing policy tt. 


v;{w<t) 


The value of playing policy vr in environment /i starting at history 




W<t- 


v;{w<t) 


The value of playing the optimum policy tt in environment /i starting 




at history ifc^t- 


K 


The optimum policy in environment /i. 


Mp) 


The p-percentile horizon. 


Qh + 1 


A binary string consisting oi h+l zeros. 
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